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Abstract 



> 

in 

^^ . I construct a Lanczos process on a large and sparse matrix and use the results of 



this iteration to compute the inverse square root of the same matrix. The algorithm 



Q"^ ■ is a stable version of an earlier proposal by the author. It can be used for problems 

j^ , related to the matrix sign and polar decomposition. The application here comes 



from the theory of chiral fermions on the lattice. 

1 Introduction 

The computation of the inverse square root of a matrix is a special problem in scientific 
computing. It is related to the matrix sign and polar decomposition [I[]. 
One may define the matrix sign by: 

sign(A) = A{A^)-'2 (1) 

where A is a complex matrix with no pure imaginary eigenvalues. 

In polar coordinates, a complex number z = x + i y,is represented by 

z=\z\e'^, = arctan- (2) 

X 



In analogy, the polar decomposition of a matrix A is defined by: 

A = V{A^A)^, V-^ = V^ (3) 

where V is the polar part and the second factor corresponds to the absolute value of A. 

The mathematical literature invloving the matrix sign function traces back to 1971 
when it was used to solve the Lyapunov and algebraic Riccati equations ||I|]. 

In computational physics one may face a similar problem when dealing with Monte 
Carlo simulations of fermion systems, the so-called sign problem @. In this case the 
integration measure is proportional to the determinant of a matrix and the polar decom- 
position may be helpful to monitor the sign of the determinant. 

The example brought in this paper comes from the recent progress in formulating 
Quantum Chromodynamics (QCD) on a lattice with exact chiral symmetry |0. 

In continuum, the massless Dirac propagator Dcont is chirally symmetric, i.e. 

IbDcont + Dcont75 = (4) 

On a regular lattice with spacing a the symmetry is suppressed according to the 
Ginsparg- Wilson relation: [Q: 

75Z) + D75 = aD^.D (5) 

where D is the lattice Dirac operator. 

An explicit example of a Dirac operator obeying this relation is the so-called overlap 
operator |Q|: 

aD = l- A{A^A)-^/\ A = M- aDw (6) 

where M is a shift parameter in the range (0, 2), which I have fixed at one. 
Dw is the Wilson operator, 

11=1 M=l 



which is a nearest-neighbors discretization of the continuum Dirac operator (it violates 
the Ginsparg- Wilson relation). V^ and A^ are first and second order covariant differences 
given by: 






where ipi is a fermion field at the lattice site i and f/^^j an SU{?i) lattice gauge filed 
associated with the oriented link (i,i + fi). These are unitary 3 by 3 complex matrices 
with determinant one. A set of such matrices forms a lattice gauge "configuration". 

7^, /i = 1, . . . ,5 are 4x4 Dirac matrices which anticommute with each-other. 

Therefore, if there are N lattice points in four dimensions, the matrix A is of order 
12N . A restive symmetry of the matrix A that comes from the continuum is the so called 
75 — symmetry which is the Hermiticity of the 75^4 operator. 

By definition, computation of D involves the inverse square root of a matrix. This is 
a non-trivial task for large matrices. Therefore several algorithms have been proposed by 
lattice QCD physicists [|, 0, § |, ITg. 

All these methods rely on matrix-vector multiplications with the sparse Wilson matrix 
Dwi being therefore feasible for large simulations. 

In fact, methods that approximate the inverse square root by Legendre ^ and Cheby- 
shev polynomials [0 need to know apriori the extreme eigenvalues of A'^A to be effective. 
This requires computational resources of at least one Conjugate Gradients (CG) inversion. 

In 1^ the inverse square root is approximated by a rational approximation, which 
allows an efficient computation via a multi-shift CG iteration. Storage here may be an 



obstacle which is remedied by a second CG step ||TT . 

The Pade approximation used by needs the knowledge of the smallest eigenvalue 
of A^A. Therefore the method becomes effective only in connection with the D inversion 
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The method presented earlier by the author |TD[ relies on taking exactly the inverse 
square root from the Ritz values. These are the roots of the Lanczos polynomial approx- 



imating the inverse of AM. 

In that work the Lanczos polynomial was constructed by applying the Hermitian 
operator 75^. The latter is indefinite, thereby responsible for observed oscillations in the 
residual vector norm ||lO| . 

Here I use a Lanczos polynomial on the positive definite matrix A^A. In this case the 
residual vector norm decreases monotonically and leads to a stable method. This is a 
crucial property that allows a reliable stopping criterion that I will present here. 

The paper is selfcontained: in the next section I will briefiy present the Lanczos 
algorithm and set the notations. In section 3, I use the algorithm to solve linear systems, 
and in section 4, the computation of the inverse square root is given. The method is 
tested in section 5 and conclusions are drawn in the end. 

2 The Lanczos Algorithm 

The Lanczos iteration is known to approximate the spectrum of the underlying matrix in 
an optimal way and, in particular, it can be used to solve linear systems |jl3[ . 
Let Qn = [?!, • • • , Qn] be the set of orthonormal vectors, such that 

A^AQ^ = QnTn+Pnqn+l{e^nY, Qi = Pi^ Pi = 1/1 1^1 12 (8) 

where T„ is a tridiagonal and symmetric matrix, b is an arbitrary vector, and /?„ a real 
and positive constant. Cm denotes the unit vector with n elements in the direction m. 

By writing down the above decomposition in terms of the vectors qi,i = 1, . . . ,n and 
the matrix elements of T„, I arrive at a three term recurrence that allows to compute these 



vectors in increasing order, starting from the vector qi. This is the Lanczos Algorithm: 

Po = 0, pi = 1/II6II2, go = 0, qi = pib 

for i = 1, . . . 

V = A^Aqi 

*=«'" (9) 

V ■=v - qiUi - qi-iPi-i 

Pi = \\v\\2 

if Pi < tol, n = i, end for 
Qi+i = v/pi 

where tol is a tolerance which serves as a stopping condition. 

The Lanczos Algorithm constructs a basis for the Krylov subspace |[T3| : 



span{6, A^Ab, ... , {A^A)''-^b} (10) 

If the Algorithm stops after n steps, one says that the associated Krylov subspace is 
invariant. 

In the floating point arithmetic, there is a danger that once the Lanczos Algorithm 
(polynomial) has approximated well some part of the spectrum, the iteration reproduces 
vectors which are rich in that direction |T^. As a consequence, the orthogonality of the 
Lanczos vectors is spoiled with an immediate impact on the history of the iteration: if 
the algorithm would stop after n steps in exact arithmetic, in the presence of round off 
errors the loss of orthogonality would keep the algorithm going on. 

3 The Lanczos Algorithm for solving A^Ax = b 

Here I will use this algorithm to solve linear systems, where the loss of orthogonality will 
not play a role in the sense that I will use a different stopping condition. 
I ask the solution in the form 

X = QuVn (11) 



By projecting the original system on to the Krylov subspace I get: 

QiA^Ax = Qib (12) 

By construction, I have 

fe = Q„eS"Vpi, (13) 

Substituting x = QnUn and using (||), my task is now to solve the system 

TnVn = eS^Vpi (14) 

Therefore the solution is given by 

X = Q„T-iei"Vpi (15) 

This way using the Lanczos iteration one reduces the size of the matrix to be inverted. 
Moreover, since T„ is tridiagonal, one can compute ?/„ by short recurences. 
If I define: 

ri = b~ A^'Axi, qi = piri, Xi = piXi (16) 

where z = 1, . . . , it is easy to show that 

Pi+iPi + Pitti + Pi-iPi-i = 

qt + Xi+iPi + Xiai + Xi_i/3i-.i = 

Therefore the solution can be updated recursively and I have the following Algorithml 



for solving the system A^Ax = b: 

Po = 0, pi = I/II6II2, go = o, qi = pib 
for i = 1, . . . 

V = A^Aqi 
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4 The Lanczos Algorithm for solving (A^A)^/^x = h 

Now I would like to compute x = (A^A)^^''^b and still use the Lanczos Algorithm. In 
order to do so I make the following observations: 

Let (AM)^^/^ be expressed by a matrix- valued function, for example the integral 
formula ^: 

{A^Ay^/^ = - / dt{t^ + A^A)-^ (19) 

T^ Jo 

From the previous section, I use the Lanczos Algorithm to compute 

{A^A)-'b = Q^T-'et^/p, (20) 

It is easy to show that the Lanczos Algorithm is shift-invariant, i.e. if the matrix 
A^A is shifted by a constant say, t^, the Lanczos vectors remain invariant. Moreover, the 
corresponding Lanczos matrix is shifted by the same amount. 

This property allows one to solve the system (t^ + A^A)x = 6 by using the same 
Lanczos iteration as before. Since the matrix (t^ + A'^ A) is better conditioned than A'^ A, 



it can be concluded that once the original system is solved, the shifted one is solved too. 
Therefore I have: 

it' + A^r'b = Qr.it' + T„)-ie["Vpi (21) 

Using the above integral formula and putting everything together, I get: 

x = (AU)-V25 = g.r-i/M"Vpi (22) 

There are some remarks to be made here: 

a) As before, by applying the Lanczos iteration on A'^A, the problem of computing 
{A^A)~^/''b reduces to the problem of computing y^ = Tn e^ /pi which is typically a 

1/2 

much smaller problem than the original one. But since Tn is full, ?/„ cannot be computed 
by short recurences. It can be computed for example by using the full decomposition of 
Tn in its eigenvalues and eigenvectors; in fact this is the method I have employed too, for 
its compactness and the small overhead for moderate n. 

b) The method is not optimal, as it would have been, if one would have applied it 
directly for the matrix (A'^AY'^. By using A'^A the condition is squared, and one looses 
a factor of two compared to the theoretical case! 

c) From the derivation above, it can be concluded that the system (ylM)^'^x = 6 is 
solved at the same time as the system A^ Ax = b. 

d) To implement the result (|22|), I first construct the Lanczos matrix and then compute 



l/n = T„'1/M"Vpi (23) 

To compute x = QnVn, I repeat the Lanczos iteration. I save the scalar products, though 
it is not necessary. 



Therefore I have the following Algorithm2 for solving the system {A'^AY^'^x = b: 

Po = 0, pi = I/II6II2, go = o, qi = pib 
for i = 1, . . . 

V = A^Aqi 

ai = qjv 

v:=v- qiai - gi-iA-i 

A = ll^'lb 

gi+i = v/j3i 



Pi+i 



■'I 



ifr^—; < toL n = i, end for 

(24) 

Set {Tn)i,i = ai, {Tn)i+i,i = {Tn)i,i+i = (3i, otherwise {Tn)i,j = 

go = o, gi = pib, Xq = o 
for i = 1, . . . ,n 

Xi — Xi—i -\- qilJn 

V = A^Aqi 

v:=v- qiai - g^-iA-i 

gi+i = v/l3i 

where by o I denote a vector with zero entries and f/„, A„ the matrices of the eigenvectors 
and eigenvalues of T„. Note that there are only four large vectors necessary to store: 
qi-i,qi,v,Xi. 



5 Testing the method 

I propose a simple test: I solve the system A^ Ax = hhy applying twice the Algorithm2, 
i.e. I solve the linear systems 

(AU)1/2^ = 6, {A^AY'^x = z (25) 

in the above order. For each approximation Xi, I compute the residual vector 

ri = h- A^Axi (26) 

The method is tested for a SU(3) configuration aX (3 = 6.0 on a 8^16 lattice, corre- 
sponding to an order 98304 complex matrix A. 

In Fig.l I show the norm of the residual vector decreasing monotonically. The stag- 
nation of I I^jI I2 for small values of tol may come from the accumulation of round off error 
in the 64-bit precision arithmetic used here. 

This example shows that the tolerance line is above the residual norm line, which 
confirms the expectation that tol is a good stopping condition of the Algorithm2. 

6 Conclusions 

I have presented a Lanczos method to compute the inverse square root of a large and 
sparse positive definite matrix. 

The method is characterized by a residual vector norm that decreases monotonically 
and a consistent stopping condition. This stability should be compared with a similar 
method presented earlier by the author [|10|, where the underlying Hermitian but indefinite 
matrix 75^4 led to appreciable instabilities in the norm of the residual vector. 

In terms of complexity this algorithm requires less operations for the same accuracy 
than its indefinite matrix counterpart. This property is guaranteed by the monotonicity 
of the residual vector norm. Nontheless, the bulk of the work remains the same. 

With the improvement in store the method is complete. 
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It shares with methods presented in [H, M the same underlying Lanczos polynomiaL 



As it is wellknown [|T3| CG and Lanczos methods for solving a linear system produce 
the same results in exact arithmetic. In fact, CG derives from the Lanczos algorithm 
by solving the coupled two-term recurences of CG for a single three-term recurence of 
Lanczos. However, the coupled two-term recurences of CG accumulate less round off. 
This makes CG preferable for ill-conditioned problems. 

There are two main differences between the method presented here and those in P, § : 

a) Since CG and Lanczos are equivalent, they produce the same Lanczos matrix. 
Therefore, any function of A^A translates for both algorithms into a function of T„ (given 
the basis of Lanczos vectors). The latter function translates into a function of the Ritz 
values, the eigenvalues of T„. That is, whenever the methods of papers P, ^ try to 
approximate the inverse square root of AM, the underlying CG algorithm shifts this 
function to the Ritz values. It is clear now that if I take the inverse square root from the 
Ritz values exactly, I don't have any approximation error. This is done in Algorithm2. 

b) Algorithm2 sets no limits in the amount of memory required, whereas the multi- 
shift CG needs to store as many vectors as the number of shifts. For high accuracy 
approximations the multi-shift CG is not practical. However, one may lift this limit in 
expense of a second CG iteration (two-step CG) [|rT|]. Therefore Algorithm2 and the 
two-step CG have the same iteration workload, with Algorithm2 computing exactly the 
inverse square root. 

Additionally, Algorithm2 requires the calculation of Ritz eigenpairs of T„, which makes 
for an overhead proportional to ~ n'^ when using the QR algorithm for the eigenvalues 
and the inverse iteration for the eigenvectors ||13[. Since the complexity of the Lanczos 
algorithm is ~ nN, the relative overhead is proportional to ~ n/N. For moderate gauge 
couplings and lattice sizes this is a small percentage. 

I conclude that the algorithms of p, ^ may be used in situations where a high accuracy 
is not required and/or A is well-conditioned. 

Experience with overlap fermions shows that high accuracy is often essential 0, |10 



In such situations the Algorithm2 is best suited. 
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Figure 1: The dots show the norm of the residual vector, whereas the hne shows the 
tolerance level set by tol in the Algorithm2. 



13 



